feat(connect): add phase 3 metadata reporting#2081
feat(connect): add phase 3 metadata reporting#2081sayedbilalbari wants to merge 20 commits intoNVIDIA:devfrom
Conversation
Prep for Phase 3 reporting — indexes get populated from SQLExecutionStart.jobTags and JobStart.spark.job.tags. Issue NVIDIA#2065.
…bTags
Uses reflective accessor to stay compatible with Spark 3.2-3.4 profiles.
Also tightens operationIdTo{Sql,Job}Ids value type to mutable.HashSet for
consistency with neighboring collections on AppBase. Issue NVIDIA#2065.
Row types for connect_sessions.csv and connect_operations.csv with derived phase durations, status, sqlID/jobID joins, and statement-file metadata. Issue NVIDIA#2065.
…rom profiler Per-application. Absent (not empty) file when the app is not in Connect mode -- matches the behavior of every other per-app table. Issue NVIDIA#2065.
Keeps large protobuf-text plans out of connect_operations.csv. Files land under connect_statements/<operationId>.txt, basename referenced in the statementFile column of connect_operations.csv. Directory is not created when there are no operations with non-empty statementText. Issue NVIDIA#2065. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ation Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
…iles Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Greptile SummaryThis PR adds Spark Connect phase 3 metadata reporting to both the profiler and qualification tools, correlating Connect sessions/operations with SQL executions and Spark jobs, persisting lifecycle metadata as per-app CSVs, and optionally writing statement payloads as sidecar Confidence Score: 5/5Safe to merge; no blocking issues found after full review of all changed files. All previously flagged P1 concerns (Hadoop filesystem bypass, path traversal in Python reader, O(n×m) operation count, missing empty-token filter) are addressed in the current diff. No new P1 or P0 issues were identified. The implementation is consistent with existing patterns, column counts match between Scala headers and YAML catalogs, and the test suite covers correctness, sanitization, and golden roundtrips. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant EL as Event Log
participant EPB as EventProcessorBase
participant AB as AppBase
participant CEH as ConnectEventHandler
participant P as Profiler/QualRawReportGenerator
participant CSW as ConnectStatementWriter
participant FS as Hadoop FileSystem
EL->>CEH: ConnectOperation/Session events
CEH->>AB: connectSessions/connectOperations.put(...)
CEH->>AB: jobTagToConnectOpId.put(jobTag → opId)
EL->>EPB: SparkListenerSQLExecutionStart
EPB->>AB: isConnectMode?
AB-->>EPB: true
EPB->>EPB: readJobTagsFromSQLStartEvent (reflection, Spark 3.5+)
EPB->>AB: operationIdToSqlIds.getOrElseUpdate(opId).add(sqlId)
EL->>EPB: SparkListenerJobStart
EPB->>AB: isConnectMode?
AB-->>EPB: true
EPB->>AB: operationIdToJobIds.getOrElseUpdate(opId).add(jobId)
P->>P: writeConnectTables(writer, app, writeStatements, hadoopConf)
P->>P: groupBy sessionId → opCountBySession
P->>P: build ConnectSessionProfileResult rows
P->>P: build ConnectOperationProfileResult rows
alt writeStatementSidecars = true
P->>CSW: writeStatementFiles(rootDir, ops, hadoopConf)
CSW->>CSW: sanitize operationId → basename
CSW->>CSW: require target.getParent == subDirPath
CSW->>FS: ToolTextFileWriter.write(text)
CSW-->>P: Map[operationId → basename]
end
P->>FS: writeCSVTable(Connect Sessions, sessionRows)
P->>FS: writeCSVTable(Connect Operations, opRows)
Reviews (8): Last reviewed commit: "fix: add Spark Connect runtime dependenc..." | Re-trigger Greptile |
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
- ConnectStatementWriter: use Hadoop FileSystem instead of java.nio.file so sidecar writes work for HDFS/S3/GCS, not just local paths - Profiler.writeConnectTables: precompute opCountBySession map to avoid O(sessions * operations) scan when emitting connect_sessions.csv - EventProcessorBase: filter empty tokens when splitting spark.job.tags - Thread hadoopConf from Qualification/Profiler through QualRawReportGenerator to the sidecar writer
|
@greptile review |
- ConnectStatementWriter: route writes through ToolTextFileWriter and ensure trailing newline, resolving both scalastyle errors flagged by pre-merge CI. - Fix scaladoc link warnings by using fully qualified names for ConnectSessionInfo / ConnectOperationInfo / OutHeaderRegistry and drop the @throws doc tag on EventUtils (annotation is retained). - Correct stale sidecar path in ConnectProfileResults doc. - Rewrite Connect test suite headers to remove agent-voice / incremental-step framing. - Tighten list_connect_statement_ops docstring in result_handler. - Expose get_table_path / get_per_app_table_path on APIResHandler and cover them with test_connect_helpers tests for wrapper/core handlers.
|
@greptileai review |
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
| - label: connectStatements | ||
| description: >- | ||
| Directory of per-operation statementText sidecars. Each | ||
| <operationId>.txt file contains the protobuf debug-format text of the |
There was a problem hiding this comment.
Are there any industry standard format file extension for storing TextProto files?
There was a problem hiding this comment.
Very valid point. The canonical protobuf TextFormat is .txtpb which means this is a valid proto output. But in our case the statementText is a diagnostic file that can be truncated. So better to have .txt file. Does not imply a complete protobug TextFormat file
|
LGTME mostly. I had a concern about the version of Spark JAR used by Tools. Incase Spark Connect JARs are not present, how do we handle the processing? |
@parthosa Very valid point, thanks for pointing out. Currently tools includes the FYI - even in case of missing jar, the current behavior pops a ClassNotFound error and move onto the next event |
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Fixes #2065
Summary
Adds Spark Connect reporting to profiler and qualification. Correlates Connect operations with SQL executions and Spark jobs, persists session/operation metadata and statement payloads, and exposes them through the Python API. No-ops for event logs without Connect activity.
What's emitted
Per application, under
<perAppDir>/(profiler) and<rootDir>/raw_metrics/<appId>/(qualification):connect_sessions.csvconnect_operations.csvsqlIds/jobIds, sidecarstatementFileconnect_statements/<operationId>.txtstatementTextsidecar per operation (skipped when empty)Python API
ResultHandlergainsget_connect_statements_dir,list_connect_statement_ops, andload_connect_statement, plus genericget_table_path/get_per_app_table_pathaccessors.connectReport.yamlregisters the three tables;ReportTableFormat.DIRECTORYcovers the sidecar directory.Implementation notes
AppBasecarries Connect state (connectSessions,connectOperations,operationId → sqlIds/jobIds,jobTag → operationId);isConnectModeis set when either sessions or operations are present, so session-only logs still emit output.SQLExecutionStart.jobTagsandJobStart.properties["spark.job.tags"].ConnectStatementWriterroutes throughToolTextFileWriterfor UTF-8 and permission parity with other tool outputs, sanitizes operation IDs, and enforces path-traversal containment.Testing
mvn verify(Connect suites:ConnectCorrelationSuite,ConnectProfileResultsSuite,ConnectProfilerOutputSuite,ConnectStatementWriterSuite,QualificationConnectOutputSuite)tox -e prepare,pylint,flake8pytest user_tools/tests/spark_rapids_tools_ut/api/test_connect_*.py